Implementation and testing of an automated EST processing and similarity analysis system

نویسندگان

  • Elizabeth Shoop
  • Ed Huai-hsin Chi
  • John V. Carlis
  • Paul Bieganski
  • John Riedl
  • Neal Dalton
  • Thomas Newman
  • Ernest F. Retzel
چکیده

Expressed sequence tag (EST) sequencing projects are being undertaken in an effort to identify the function of as many genes as possible fmm entire genomes. Putative function can be determined by analyzing the similarity of the ESTs to sequences in the public databases. We are involved in a long-term project to research and develop database technology to store and analyze ESTs for Arabidopsis thaliana. The massive amounts of EST’s being produced through automated sequencing technologies necessitates the automated processing and similarity analysis of the ESTs. This paper describes a complete sofnvare system that takes ESTsfrom a sequencing machine, analyzes them for quality and searches in public databases of previously known sequences. Amomating the processing and analysis of the several thousand ESTs produced to date by the Michigan State University Arabidopsis cDNA Sequencing Project has improved the quality of the EST data and the speed at which ESTs can be entered in the public databases. Additional searches that compensate for low complexity regions in the ESTs allowed for more accurate review of the similarity results. The results of the similarity searches are packaged into summarized similarity hit information files with an indication of whether it is possible to identify the ESTs. All processed ESTs and their similarity analysis are available through a Mosaic server, which includes a parsed presentation of the search results, and a three-dimensional graphical display of similarities found. Automating searches in the public databases will make it possible to store putative functional relationships in a database system, in addition to the sequences. An extensible database management system we are developing on a commercial platform will store the data and search results. This will enable biologists to conduct ad hoc a:ploration using a high-level query language. 1.0 Introduction We are engaged in a long-term research project to provide a database system to support the collection, analysis, and storage of data from cDNA sequencing projects. The short term objective of this project is the acquisition of EST data. The goal of EST sequencing projects is to sequence enough short sections of expressed cDNA sequence (expressed sequence tags, or FISTS) to obtain the functional expression units of an entire genome, to identify the function of as many genes as possible and discover novel genes. The method used to identify each EST and infer possible function of its corresponding protein is to conduct sequence similarity searches against the public databanks of known DNA and protein sequences [McCo92, Adam92]. The common method used to do this is to run the BLAST similarity search programs [Al&901 for each EST and find “hits” to known sequences, where hits are regions in the E?ST having a certain degree of similarity to regions in the known sequences. Hits from EST similarity searches enable inference of probable biological function (often referred to as putative function), whereas a lack of hits implies the possibility of a novel gene discovery. Motivated by the fact that researchers on EST sequencing projects are producing sequences at such a high rate that manual processing and similarity analysis is virtually impossible, we seek to build a software system that automates each phase of their projects. These phases include processing of the raw sequences, making similarity results available to the rest of the user community, and identifying putative function of each EST. We report here on the implementation and testing of a system that extends previous systems towards the ultimate goal of automated putative function determination. This system consists of these major components: l A single processing program that is invoked on a set of raw EST sequences. This program sets in motion a series of modules that: 1) check each sequence for quality, trim each sequence to an acceptable level of quality if necessary, and reject low quality ones; 2) trim off the leading vector sequence on each acceptable EST and reject clones that are all vector and have no insert; 3) run blastx and blastn on each acceptable EST; and 4) translate and check each reading frame for low complexity regions, and run blastp when low complexity regions are found. l A program that displays the alignments for hits in graphical form. l A program that takes the result of this processing and similarity searching, parses the BLAST output, and creates: 1 ) a summary file for a set of BSTs, containing all the hits found from blastn, blastx. and blastp, including an indication if the hits are sufficient to possibly identify an EST, and 2 ) a comprehensive file for each EST, in hypertext markup language (HTML) format, containing all the results of the processing and similarity searches, including hypertext links between the related information and images from the graphical results display program, for viewing on an Internet Mosaic server. 0 A program that creates properly formatted files, for a set of acceptable BSTs, for submission to dbBST [Bogu93], which is the public point of submission for BSTs. In addition to automating each part of EST sequencing projects, we are interested in providing to the community faster and more accurate ways of exploring and finding similarities of interest from the large set of similarity information that is accumulating. To this end, we have designed and begun implementing a database system for similarity results. Using a commercially available database management system with a standard high-level query language, queries in that language will better enable biolo-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An automatic test case generator for evaluating implementation of access control policies

One of the main requirements for providing software security is the enforcement of access control policies which aim to protect resources of the system against unauthorized accesses. Any error in the implementation of such policies may lead to undesirable outcomes. For testing the implementation of access control policies, it is preferred to use automated methods which are faster and more relia...

متن کامل

An Automated MR Image Segmentation System Using Multi-layer Perceptron Neural Network

Background: Brain tissue segmentation for delineation of 3D anatomical structures from magnetic resonance (MR) images can be used for neuro-degenerative disorders, characterizing morphological differences between subjects based on volumetric analysis of gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF), but only if the obtained segmentation results are correct. Due to image arti...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

A Novel Method for Automated Estimation of Effective Parameters of Complex Auditory Brainstem Response: Adaptive Processing based on Correntropy Concept

Objectives: Automated Auditory Brainstem Responses (ABR) peak detection is a novel technique to facilitate the measurement of neural synchrony along the auditory pathway through the brainstem. Analyzing the location of the peaks in these signals and the time interval between them may be utilized either for analyzing the hearing process or detecting peripheral and central lesions in the human he...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995